155 research outputs found
Decreasing log data of multi-tier services for effective request tracing
Previous work shows request tracing systems help understand and debug the
performance problems of multi-tier services. However, for large-scale data
centers, more than hundreds of thousands of service instances provide online
service at the same time. Previous work such as white-box or black box tracing
systems will produce large amount of log data, which would be correlated into
large quantities of causal paths for performance debugging. In this paper, we
propose an innovative algorithm to eliminate valueless logs of multitiers
services. Our experiment shows our method filters 84% valueless causal paths
and is promising to be used in large-scale data centers
Characterization and Architectural Implications of Big Data Workloads
Big data areas are expanding in a fast way in terms of increasing workloads
and runtime systems, and this situation imposes a serious challenge to workload
characterization, which is the foundation of innovative system and architecture
design. The previous major efforts on big data benchmarking either propose a
comprehensive but a large amount of workloads, or only select a few workloads
according to so-called popularity, which may lead to partial or even biased
observations. In this paper, on the basis of a comprehensive big data benchmark
suite---BigDataBench, we reduced 77 workloads to 17 representative workloads
from a micro-architectural perspective. On a typical state-of-practice
platform---Intel Xeon E5645, we compare the representative big data workloads
with SPECINT, SPECCFP, PARSEC, CloudSuite and HPCC. After a comprehensive
workload characterization, we have the following observations. First, the big
data workloads are data movement dominated computing with more branch
operations, taking up to 92% percentage in terms of instruction mix, which
places them in a different class from Desktop (SPEC CPU2006), CMP (PARSEC), HPC
(HPCC) workloads. Second, corroborating the previous work, Hadoop and Spark
based big data workloads have higher front-end stalls. Comparing with the
traditional workloads i. e. PARSEC, the big data workloads have larger
instructions footprint. But we also note that, in addition to varied
instruction-level parallelism, there are significant disparities of front-end
efficiencies among different big data workloads. Third, we found complex
software stacks that fail to use state-of-practise processors efficiently are
one of the main factors leading to high front-end stalls. For the same
workloads, the L1I cache miss rates have one order of magnitude differences
among diverse implementations with different software stacks
Automatic Performance Debugging of SPMD-style Parallel Programs
The simple program and multiple data (SPMD) programming model is widely used
for both high performance computing and Cloud computing. In this paper, we
design and implement an innovative system, AutoAnalyzer, that automates the
process of debugging performance problems of SPMD-style parallel programs,
including data collection, performance behavior analysis, locating bottlenecks,
and uncovering their root causes. AutoAnalyzer is unique in terms of two
features: first, without any apriori knowledge, it automatically locates
bottlenecks and uncovers their root causes for performance optimization;
second, it is lightweight in terms of the size of performance data to be
collected and analyzed. Our contributions are three-fold: first, we propose two
effective clustering algorithms to investigate the existence of performance
bottlenecks that cause process behavior dissimilarity or code region behavior
disparity, respectively; meanwhile, we present two searching algorithms to
locate bottlenecks; second, on a basis of the rough set theory, we propose an
innovative approach to automatically uncovering root causes of bottlenecks;
third, on the cluster systems with two different configurations, we use two
production applications, written in Fortran 77, and one open source
code-MPIBZIP2 (http://compression.ca/mpibzip2/), written in C++, to verify the
effectiveness and correctness of our methods. For three applications, we also
propose an experimental approach to investigating the effects of different
metrics on locating bottlenecks.Comment: 16 pages, 23 figures. Accepted by Journal of Parallel and Distributed
Computing (JPDC
Automatic Performance Debugging of SPMD Parallel Programs
Automatic performance debugging of parallel applications usually involves two
steps: automatic detection of performance bottlenecks and uncovering their root
causes for performance optimization. Previous work fails to resolve this
challenging issue in several ways: first, several previous efforts automate
analysis processes, but present the results in a confined way that only
identifies performance problems with apriori knowledge; second, several tools
take exploratory or confirmatory data analysis to automatically discover
relevant performance data relationships. However, these efforts do not focus on
locating performance bottlenecks or uncovering their root causes. In this
paper, we design and implement an innovative system, AutoAnalyzer, to
automatically debug the performance problems of single program multi-data
(SPMD) parallel programs. Our system is unique in terms of two dimensions:
first, without any apriori knowledge, we automatically locate bottlenecks and
uncover their root causes for performance optimization; second, our method is
lightweight in terms of size of collected and analyzed performance data. Our
contribution is three-fold. First, we propose a set of simple performance
metrics to represent behavior of different processes of parallel programs, and
present two effective clustering and searching algorithms to locate
bottlenecks. Second, we propose to use the rough set algorithm to automatically
uncover the root causes of bottlenecks. Third, we design and implement the
AutoAnalyzer system, and use two production applications to verify the
effectiveness and correctness of our methods. According to the analysis results
of AutoAnalyzer, we optimize two parallel programs with performance
improvements by minimally 20% and maximally 170%.Comment: The preliminary version appeared on SC 08 workshop on Node Level
Parallelism for Large Scale Supercomputers. The web site is
http://iss.ices.utexas.edu/sc08nlplss/program.htm
AccuracyTrader: Accuracy-aware Approximate Processing for Low Tail Latency and High Result Accuracy in Cloud Online Services
Modern latency-critical online services such as search engines often process
requests by consulting large input data spanning massive parallel components.
Hence the tail latency of these components determines the service latency. To
trade off result accuracy for tail latency reduction, existing techniques use
the components responding before a specified deadline to produce approximate
results. However, they may skip a large proportion of components when load gets
heavier, thus incurring large accuracy losses. This paper presents
AccuracyTrader that produces approximate results with small accuracy losses
while maintaining low tail latency. AccuracyTrader aggregates information of
input data on each component to create a small synopsis, thus enabling all
components producing initial results quickly using their synopses.
AccuracyTrader also uses synopses to identify the parts of input data most
related to arbitrary requests' result accuracy, thus first using these parts to
improve the produced results in order to minimize accuracy losses. We evaluated
AccuracyTrader using workloads in real services. The results show: (i)
AccuracyTrader reduces tail latency by over 40 times with accuracy losses of
less than 7% compared to existing exact processing techniques; (ii) when using
the same latency, AccuracyTrader reduces accuracy losses by over 13 times
comparing to existing approximate processing techniques.Comment: 10 pages, 8 figures, 2 table
PowerTracer: Tracing requests in multi-tier services to save cluster power consumption
As energy proportional computing gradually extends the success of DVFS
(Dynamic voltage and frequency scaling) to the entire system, DVFS control
algorithms will play a key role in reducing server clusters' power consumption.
The focus of this paper is to provide accurate cluster-level DVFS control for
power saving in a server cluster. To achieve this goal, we propose a request
tracing approach that online classifies the major causal path patterns of a
multi-tier service and monitors their performance data as a guide for accurate
DVFS control. The request tracing approach significantly decreases the time
cost of performance profiling experiments that aim to establish the empirical
performance model. Moreover, it decreases the controller complexity so that we
can introduce a much simpler feedback controller, which only relies on the
single-node DVFS modulation at a time as opposed to varying multiple CPU
frequencies simultaneously. Based on the request tracing approach, we present a
hybrid DVFS control system that combines an empirical performance model for
fast modulation at different load levels and a simpler feedback controller for
adaption. We implement a prototype of the proposed system, called PowerTracer,
and conduct extensive experiments on a 3-tier platform. Our experimental
results show that PowerTracer outperforms its peer in terms of power saving and
system performance.Comment: 10 pages, 22 figure
Anomaly Analysis for Co-located Datacenter Workloads in the Alibaba Cluster
In warehouse-scale cloud datacenters, co-locating online services and offline
batch jobs is an efficient approach to improving datacenter utilization. To
better facilitate the understanding of interactions among the co-located
workloads and their real-world operational demands, Alibaba recently released a
cluster usage and co-located workload dataset, which is the first publicly
dataset with precise information about the category of each job. In this paper,
we perform a deep analysis on the released Alibaba workload dataset, from the
perspective of anomaly analysis and diagnosis. Through data preprocessing, node
similarity analysis based on Dynamic Time Warping (DTW), co-located workloads
characteristics analysis and anomaly analysis based on iForest, we reveals
several insights including: (1) The performance discrepancy of machines in
Alibaba's production cluster is relatively large, for the distribution and
resource utilization of co-located workloads is not balanced. For instance, the
resource utilization (especially memory utilization) of batch jobs is
fluctuating and not as stable as that of online containers, and the reason is
that online containers are long-running jobs with more memory-demanding and
most batch jobs are short jobs, (2) Based on the distribution of co-located
workload instance numbers, the machines can be classified into 8 workload
distribution categories1. And most patterns of machine resource utilization
curves are similar in the same workload distribution category. (3) In addition
to the system failures, unreasonable scheduling and workload imbalance are the
main causes of anomalies in Alibaba's cluster
PhoenixCloud: Provisioning Resources for Heterogeneous Workloads in Cloud Computing
As more and more service providers choose Cloud platforms, which is provided
by third party resource providers, resource providers needs to provision
resources for heterogeneous workloads in different Cloud scenarios. Taking into
account the dramatic differences of heterogeneous workloads, can we
coordinately provision resources for heterogeneous workloads in Cloud
computing? In this paper we focus on this important issue, which is
investigated by few previous work. Our contributions are threefold: (1) we
respectively propose a coordinated resource provisioning solution for
heterogeneous workloads in two typical Cloud scenarios: first, a large
organization operates a private Cloud for two heterogeneous workloads; second,
a large organization or two service providers running heterogeneous workloads
revert to a public Cloud; (2) we build an agile system PhoenixCloud that
enables a resource provider to create coordinated runtime environments on
demand for heterogeneous workloads when they are consolidated on a Cloud site;
and (3) A comprehensive evaluation has been performed in experiments. For two
typical heterogeneous workload traces: parallel batch jobs and Web services,
our experiments show that: a) in a private Cloud scenario, when the throughput
is almost same like that of a dedicated cluster system, our solution decreases
the configuration size of a cluster by about 40%; b) in a public Cloud
scenario, our solution decreases not only the total resource consumption, but
also the peak resource consumption maximally to 31% with respect to that of EC2
+RightScale solution.Comment: 18 pages. This is an extended version of our CCA 08 paper(The First
Workshop of Cloud Computing and its Application, CCA08, Chicago, 2008): J.
Zhan L. Wang, B. Tu, Y. Li, P. Wang, W. Zhou, D. Meng. 2008. Phoenix Cloud:
Consolidating Different Computing Loads on Shared Cluster System for Large
Organization. The modified version can be found on
http://arxiv.org/abs/0906.134
HybridTune: Spatio-temporal Data and Model Driven Performance Diagnosis for Big Data Systems
With tremendous growing interests in Big Data systems, analyzing and
facilitating their performance improvement become increasingly important.
Although there have much research efforts for improving Big Data systems
performance, efficiently analysing and diagnosing performance bottlenecks over
these massively distributed systems remain a major challenge. In this paper, we
propose a spatio-temporal correlation analysis approach based on stage
characteristic and distribution characteristic of Big Data applications, which
can associate the multi-level performance data fine-grained. On the basis of
correlation data, we define some priori rules, select features and vectorize
the corresponding datasets for different performance bottlenecks, such as,
workload imbalance, data skew, abnormal node and outlier metrics. And then, we
utilize the data and model driven algorithms for bottlenecks detection and
diagnosis. In addition, we design and develop a lightweight, extensible tool
HybridTune, and validate the diagnosis effectiveness of our tool with
BigDataBench on several benchmark experiments in which the outperform
state-of-the-art methods. Our experiments show that the accuracy of
abnormal/outlier detection we obtained reaches about 80%. At last, we report
several Spark and Hadoop use cases, which are demonstrated how HybridTune
supports users to carry out the performance analysis and diagnosis efficiently
on the Spark and Hadoop applications, and our experiences demonstrate
HybridTune can help users find the performance bottlenecks and provide
optimization recommendations
Comparison and Benchmarking of AI Models and Frameworks on Mobile Devices
Due to increasing amounts of data and compute resources, deep learning
achieves many successes in various domains. The application of deep learning on
the mobile and embedded devices is taken more and more attentions, benchmarking
and ranking the AI abilities of mobile and embedded devices becomes an urgent
problem to be solved. Considering the model diversity and framework diversity,
we propose a benchmark suite, AIoTBench, which focuses on the evaluation of the
inference abilities of mobile and embedded devices. AIoTBench covers three
typical heavy-weight networks: ResNet50, InceptionV3, DenseNet121, as well as
three light-weight networks: SqueezeNet, MobileNetV2, MnasNet. Each network is
implemented by three frameworks which are designed for mobile and embedded
devices: Tensorflow Lite, Caffe2, Pytorch Mobile. To compare and rank the AI
capabilities of the devices, we propose two unified metrics as the AI scores:
Valid Images Per Second (VIPS) and Valid FLOPs Per Second (VOPS). Currently, we
have compared and ranked 5 mobile devices using our benchmark. This list will
be extended and updated soon after
- …